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This paper provides answers to questions regarding the almost 
sure limiting behavior of rooted, binary tree-structured rules for re- 
gression. Examples show that questions raised by Gordon and Olshen 
in 1984 have negative answers. For these examples of regression func- 
tions and sequences of their associated binary tree-structured approx- 
imations, for all regression functions except those in a set of the first 
category, almost sure consistency fails dramatically on events of full 
probability. One consequence is that almost sure consistency of bi- 
nary tree-structured rules such as CART requires conditions beyond 
requiring that (1) the regression function be in , (2) partitions of 
a Euclidean feature space be into polytopes with sides parallel to 
coordinate axes, (3) the mesh of the partitions becomes arbitrarily 
fine almost surely and (4) the empirical learning sample content of 
each polytope be "large enough." The material in this paper includes 
the solution to a problem raised by Dudley in discussions. The main 
results have a corollary regarding the lack of almost sure consistency 
of certain Bayes-risk consistent rules for classification. 

1. Introduction. Rooted, binary tree-structured methods have been im- 
portant modern statistical tools for regression, classification, probability 
class estimation, clustering and survival analysis; see books by Breiman, 
Friedman, Olshen and Stone [1], Gersho and Gray [7], Devroye, Gyorfi and 
Lugosi [4], Ripley [12], Zhang and Singer [15], Hastie, Tibshirani and Fried- 
man [9] and their references. These books include algorithms, wide ranging 
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applications, and theory. The last has involved an "empirical Lebesgue in- 
tegral" (an expression first used by Peter Huber) , along with connections to 
the asymptotically minimax approximation of functions (see [6]), and has 
motivated improvements to the celebrated large deviation result of Vapnik 
and Chervonenkis; see [10, 11]. To put this paper in context, see [5, 8]. 

My primary goal is to answer in the negative questions raised by Gordon 
and Olshen [8] regarding the almost sure limiting behavior of rooted, binary 
tree-structured rules for regression. There is also solution to a problem posed 
by Dudley in discussions. The arguments regarding regression can be applied 
to obtain a certain negative result concerning classification. The remainder 
of this section is an introduction to terminology and the results in the re- 
mainder of the paper. The first part of Section 2 is a summary of relevant 
results on martingales, on the differentiation of integrals and also on equiv- 
ariance. It is intended to place Theorems 1.2 and 1.3 and Corollary 1.4 in the 
somewhat subtle context of previous work. Readers will see that conclusions 
have two distinct parts, one probabilistic and the other concerning the dif- 
ferentiation of integrals; see (4.1). Lemma 2.1 and Section 3 are expositions 
of the key parts of the counterexamples; they concern the differentiation of 
integrals insofar as it is related to rooted, binary tree-structured statisti- 
cal rules. Section 4 provides details by means of which proofs of the two 
theorems and the corollary are completed. 

As in Breiman, Friedman, Olshen and Stone ([1], Chapter 10), a finite 
rooted binary tree is a finite nonempty set T of positive integers together 
with (for t£T) two functions, left{t) and right{t), that map T to T U {0} 
and which satisfy the following two properties: (1) for each t £ T, either 
left{t) = right{t) = 0, or left{t) > t and right{t) > t; (2) for each t (^T, other 
than the smallest integer in T, there is exactly one s G T for which either 
t = left{s) OT t = right{s). The minimum element of T is called the root of T. 
If s, t G T and t = left{s) or t = right[s), then s is called the parent of t. The 
root of T has no parent, but every other t G T has a unique parent. At^T 
is called a terminal node if it is not a parent, that is, if left{t) = right{t) = 0. 
A finite partition of a set is called a finite, rooted, binary tree- structured 
partition if there exist a finite, rooted, binary tree and a bijection that 
associates members of the partition with terminal nodes of the tree. For 
each member of any sequence of nested subtrees of T with common root, it 
is required that there exist a bijection that associates that subtree with a 
corresponding subpartition, where the nesting of partitions and of subtrees 
correspond in an obvious way. A real-valued function h on 17 is a binary 
tree- structured function if there is a finite, rooted, binary tree-structured 
partition of O and h is constant on each member of the partition. 

Throughout, d-dimensional Euclidean space is denoted by 71'^, an impor- 
tant subset being the open unit cube W^. Much mathematics concerns the 
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case d = 2. Our principal focus is on rooted, binary tree-structured partitions 
of TZ'^ (or W^) into boxes. 

A box is a set {x} = B C TZ'^ that is the solution set of a system of inequal- 
ities defined by inner products b^^^ • x < c or •x>c, fc=l,2,...,i^<oo, 
where 7^'^ 3 b^^') / and c is real. If for each linear inequality that defines 
B exactly one coordinate is not 0, then B is a basic box or, alternatively, an 
interval. 

Our focus is on rooted, binary tree-structured partitions of IZ"^ (alter- 
natively, of W^) into a finite number of basic boxes. There is an obvious 
bijection that associates terminal leaves of the tree and basic boxes of the 
partition without nonempty subsets that are themselves basic boxes of the 
partition. Q is a generic symbol for a finite partition of IZ"^ (or W^), all of 
whose component subsets are basic boxes. For x £ 7^*^, denote by i?(x) the 
unique, smallest basic box in Q that contains x. For a sequence of such par- 
titions Q^^\ B^^\x.) has an obvious meaning. Write B^^'^ for {^^'^^•'(x)}. 

The diameter of i?^(x) is defined as 

5^(x)=sup{||z-y||:2/,zGS(^)(x)}, 

while the norm of Q^^\ WQ^^^l, is defined as 

||Q(^)|| =max(^7v(a;). 

The assumptions entail that writing "max" makes sense because {57v(x)} is 
finite. 

Suppose that (X, Y), (Xi, Yi), . . . , (X^v, Yj\f) are independent, identically 
distributed (i.i.d.) vectors, 

(1.1) X G 7^'^, y G 7^^ e{\y\) < oo. 

Write 

(1.2) h{x)=E(Y\X = x.) 

for the regression of Y on X. The test case X and learning sample {(Xj, 1^) : 
i = 1, . . . , N} are given; h is to be estimated. Write 

(1.3) flN = A7v(X) = flNi^, (Xi, yi), . . . , {Xn,Yn)). 

Throughout, /i at is a simple average of those Yi's, 1 < i < N , for which X, 
lies in the same box of a partition of TZ'^ as X, provided the box has positive 
empirical probability. Equalities (1.3) are made precise in what follows by 
(1.7) and (1.8). 

For a measurable subset S C TZ'^, define /i(S') and F{S) as 



(1.4) fi{S) = E{YIs{X)), F{S)=E{IsiX))=P{XeS). 
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Is'(X) = 1 if X G S* and is otherwise. The reader can check that a version 
of /i(x) is %{x). For P(X G B(^)(x)) > 0, defi 

(1.5) /i^(x)- ^ " 



P(X€BW(x))' 

Of course, we do not observe /i^v in appUcations. At each stage N of sam- 
phng, we are given Q^^\ a finite, rooted, binary tree-structured partition of 
VJ^ into basic boxes that depends measurably on {(X„, Y^) : n = 1, . . . , A^}. 
Define by 

(1.6) Tn = fT{(X„, y„) : n = 1, . . . , AT : /^(x) : B G Q(^)}, 

where (t{-} is the cr-field generated by the random quantities inside {■}. 
I quote a lemma that appears as Lemma 3.12 in [3]. 

Lemma 1.1. E{Y\Tn) =hN{'^)- 

Definitions and Notation. For xeTZ^ and B^^\x) = B, write 

Fn{B) 



1 ^ 



where 

(1-8) /x^(B) = ^^F„/b(X„) 

n=l 

andF^(i?) = iE^=i/B(Xn). 

Note from Lemma 1.1 and (1.7) and (1.8) that h]\f{x) bears the same 
relationship to B^^^ and that /lAr(x) does to B^^^ and F. 

One can show that h]y(X.) — > /i(X) in various senses as A" grows without 
bound; see, for example, [8], Theorem 12.7 of [1] and [4, 5, 10, 11, 14]. A par- 
ticularly strong notion of convergence, but one that matters for applications, 
is unconditional almost sure convergence, where "unconditional" is meant 
with respect to the learning sample and test case. The question arises as to 
whether hj^ is consistent in this very strong sense. A major point of this 
paper is that this strong notion of consistency does not generally hold. 

Theorem 1.2. There exists a sequence Q^^^ of finite, rooted, binary 
tree- structured partitions of lA^ for which \\Q^^^\\ — > 0, as well as a set 
{(X.„,y„) :n = l,2,...} and an X that satisfy (1.1) where X is uniformly 
distributed on U"^ , E{\Y\) is finite, P{hj\f(X.) — /ijv(X) — > 0) = 1, and yet 
P(/i7v(X) — h(X.) ^ 0) = 0. Thus, the analogue of "variance" tends to al- 
most surely and the diameters of basic boxes of partitions tend to surely. 
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However, the analogue of bias almost surely does not tend to and P(/i7v(X) - 
h(X)) = 0. 

The perverse behavior of /i7v(X) in the example of interest here is sum- 
marized in the next theorem. Without loss of generality, we assume that 
y > 0, so /i(x) > 0. 

Theorem 1.3. With (X,y) and {Q^^^ as in Theorem 1.2 and Y>0, 
{h : £'(/i(X) < oo and limhj\i{x) < oo for some x € U^} is of the first category 
in C^U^). 

The main examples are also relevant to understanding the (two-class) 
classification problem. Thus, let Y = 1 or 2 with probability 1/2 each. 
Scale the nonnegative h of Theorem 1.2 to have integral 1 and thus to 
be a probability density on U^. Suppose that given y = 1, X has density h 
and given y = 2, X has the uniform distribution on U^. Given the train- 
ing sample described in (1.1), a (measurable) empirical classification rule 
d]\f{x) = dN{{0^i,Yi) -.1 = 1,... , A^})(x) is given. Then dAr(x) takes values 
1 or 2 and is a guess of the unknown Y when X = x. We lose one dollar 
for an incorrect guess, otherwise we lose nothing. It is not difficult to show 
that the rule "(iB(x) = 1 if /i(x) > 1, otherwise (i_B(x) = 2" is a Bayes rule. 
Its expected loss is the "Bayes risk" of the Bayes rule d^. In practice, we 
would not know h, so we could not compute a Bayes rule. 

A sequence d]\f of classification rules is said to be Bayes-risk consistent if 
the sequence of expected losses converges to the Bayes-risk of a Bayes rule 
as N increases without bound. 

Corollary 1.4. For any e > and the stated problem of two-class clas- 
sification, with the learning sample as in (1.1) and that which precedes it and 
with {Q^^^} as in Theorem 1.2, there is a sequence of rooted, binary tree- 
structured classification rules d^ with the following property: the rules are 
Bayes-risk consistent, but P{dN(X-) ^^(X)) < e. As before, ds is a Bayes 
rule for the problem. 

2. Martingales and the difTerentiation of integrals. The goal of this sec- 
tion is to lend perspective to the main examples of this paper. 

Given an function / on a probability space and a monotonic sequence 
of sub-cr-fields of a base cr-field, the martingale convergence theorem ensures 
that the sequence of successive conditional expectations converges almost 
surely to the conditional expectation given the "limit" cj-field. We are in- 
terested in the case where the sequence is monotonically increasing. If / 
is measurable with respect to the c-field (T, say) that is generated by the 
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sequence, then the hmit random variable is / itself, at least up to a T set 
of probability 0. 

With our notions of basic box and /iAr(X), provided each hyperplane de- 
termined by the boundary of each Bm € Q^^^ contains at least one X„, 1 < 
n < N, then hjy is equivariant to strictly monotonic transformations of 
the coordinate axes. Thus, if T : 71'^ TZ'^ is of the form T(xi, . . . ,Xd) = 
{hi{xi), . . . , hd{xd)) with hi strictly monotone, then T maps basic boxes to 
basic boxes — in an abuse of notation, h]y{T{x)) = /lAr(x). Note that nothing 
is lost if we take the range of X to be W^. Clearly, mappings such as T do 
not preserve ratios of sides of boxes. 

Our application allows restrictions on neighborhood systems that are dif- 
ferent from bounds on the ratios of sides of boxes. For one, rectangular 
neighborhoods are always members of finite partitions of U'^, each of which 
is a rooted, binary tree-structured partition. The members of a partition are 
the atoms of a finite c-field of subsets of U^. Because the finite c-fields are 
shown not to differentiate every function, or even "most" such functions, 
they cannot be nested. Even so, Gordon and Olshen ([8], Section 6) asked if 
the restrictions to the particularly simple probability space and shapes of the 
atoms would allow the relaxation of assumptions on nesting and thereby an 
extension of the martingale theorem, not to mention extensions of theorems 
regarding the almost sure consistency of binary tree-structured algorithms 
for regression. 

Definition. For B dTZ"-, let 

(5(S)=sup{||z-y|| :y,zGB}. 

For each x € 7?.", suppose that yB(x) is a collection of bounded Borel sets 
with positive Lebesgue measure, that B S 'B(x) implies x B and that for 
each x, there exists a sequence iJ^ = -Rfc(x) C B{x) for which S{Rk) — s- as 
A; — > oo. Then 

B= [J B{x.) is a differentiation basis. 

Write / G C^{TZ'^) if a version of / is Borel and the Lebesgue integral 
|/(x)|A(dx) is finite. If / € £^(7^") and B \s a. differentiation basis, then 
B differentiates C^iJZ^) if for Lebesgue almost every x, x G B^, k = 1,2, ... , 
Bk^B and 5{Bk) — > implies 

hm (A(i?fc))-i / /(u)A(du)=/(x). 

Write -Si(x) for the collection of open, bounded cubes containing x G TZ^ 
and ^j'(x) for the collection of open, bounded cubes centered at x. Write 
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Bi = [j^Bi{-x.) and = \J^Bl{-x.). Both Bi and B^ are differentiation bases. 
The Lebesgue differentiation theorem says that BI differentiates C^{W). 
Also, Bi differentiates . These conclusions remain true if the definitions 
of Bi and B^ are relaxed to allow their members to be basic boxes instead 
of cubic intervals, but with a finite bound on the ratio of dimensions of any 
two sides of the boxes. 

Let ;B2(x) be the set of otherwise unrestricted basic boxes that contain 
x € 7^", and define ^^(x) by analogy. Let B2 = Ux ^2{-x) and B^ = Ux -82 (x)- 
Neither Bo nor even BI differentiates /:^(7^") or, for that matter, C}{U'^) 
(pages 95 and 96 of [8]). If in addition to / E £^(7^"), we also have 



then B2 differentiates /. 

The author believes the proof of Lemma 2.1 given below to be new. This 
lemma is at the heart of the counterexamples. 

Lemma 2.1. For = 3,4,5, ... , there is a nonnegative Borel on 
for which: 



Obviously, {i?2,Af(x)} can here be taken to be open intervals. 

Proof of Lemma 2.1. Let e > be given. Further, set /3„ = l/n(lnn)^, 
rin = (Inn)"-"^ and 7„ = (Inn)^/^. Write 5'„, for {0 < x < Pn/VnjO <y < rjn} U 
{0 < X < r]n,0 <y < (in/ fin}- The set 5„ is the union of two oblong rect- 
angles, one contiguous to the x-axis and one contiguous to the y-axis. For 
{x,y) =u eU'^, gn = gn(u) is defined to be 7„/5'„(u). Obviously, gn > 
and jnPn < \\9n\\i < 27„/3„. Write Rn = SnU {u eU^ : xy < /?„, (/?„/??„) < 
X < r]n, {I3n/rjn) < y < Vn} ■ Thus, Rn is the union of Sn and a set that is 
bounded by the x-axis, the y-axis, {x = rjn}, {y = rjn} and the hyperbola 
{xy = Pn}- Furthermore, the hyperbola has nonempty intersection with Sn- 
Note that for u S Rn, there exists a basic box -R,^(u) = R'^ C Rn bounded 
by the x-axis, the y-axis and with a vertex on the hyperbola {xy = (3n} such 
that 





(iii) /B,,^(x)/^(u)A(du) >iVA(i?2,jv(x)). 
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{i?^(u)} can be assumed to consist of only a countable class of open subsets 
of Rn- Also, 

\{Rn) =Pn+Pn dx/x = Pn + Wn Inry^ + ln/3„). 

Therefore, ^X{Rn) = oo. On the other hand, because J^lnPn converges, 
there exists an N = N{e) sufficiently large that J2'n InPn < e/2, so |[ J2'n 9n\\i < 
e. 

For n = A^, + 1, . . . on the square with vertices (?/„, 1 — r/„), 

(1 — T]n,r]n), (1 — ??n, 1 " "Hn) , choosc a point Pn uniformly at random so that 
Pn, Pn+1,--- are independent. Place a square with sides ijn in the cited 
(larger) square so that the center is at P„ and the sides are parallel to 
the coordinate axes. Call this random square (Sq)n- Three subsets of (Sq)n 
require definition. 

In what follows, two planar sets are homothetic if one is identical to the 
other up to a rigid motion of the plane not involving rotation. Denote by 
Sn the subset of (Sq)„ that is homothetic to Sn, by TZn the subset of (Sq)„ 
that is homothetic to Rn and by TZ'^ the subset of (Sq)^ that is a basic 
box and is homothetic to the basic box R'^. For uGU^ define the (random) 
function h = /i(u) hy h = J2n InlSn- Necessarily, < h and \\h\\i < e. Be- 
cause J2 -^(-^n) = OO) manipulation of indicator functions, independence and 
monotone convergence guarantee that almost surely 

A(U^n)=l. 

Moreover, 

/ VA(K)>^7n/oo. 
Jtz' 2 

It now follows from Fubini's theorem that there exists a real- valued function 
g on for which (i) < g € C^iU'^), (ii) \\g\\i < e and (iii) for almost all 
Uq GU'^, there exists a basic box TZ'^ = TZ'^{uq) C with uq G TZ'^ and 

(2.1) / g/XiK)>e-\ 

at least for n satisfying > e~^. 

Finally, one can argue that "almost all uq" can, in fact, be "all." With 
the TZ'^ open, the set of uq of full measure for which (2.1) holds is seen to be 
open. Its complement is thus a closed set of Lebesgue measure 0. Call it A^. 
One sees that there exists G G C^(U^) that is continuous on IA'^\N and that 
tends to 00 as its argument tends to A^. Without loss of generality, G > 0. 
Now let f = g + G m. order to establish Lemma 2.1. □ 
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Definition. For Q a finite partition of into basic boxes B with 
X{B) > for allBeQ, f(£ C^{U^) and ueU^, define 

(2.2) E{f\Q)iu) = ^ IbH ^j^ /(x)A(dx)/A(i?)) . 

3. Examples. Material in this section expands upon that of Lemma 2.1 
and is at the heart of the proofs of Theorems 1.2 and 1.3 and Corollary 1.4. 
Compare the results here with those of Busemann and Feller [2] and also 
with those of Saks [13]. 

Define U% to be {u eU^,u= (s,t) : N''^ <s,t<l- N~^}. Therefore, 
the (countable class of) open basic boxes {TZ'^{u) : u € W^}, whose exis- 
tence is ensured by Lemma 2.1, is an open cover of U^. Because U^^ is 
compact in the usual topology, the Heine-Borel theorem guarantees the 
existence of a finite subcover of open basic boxes, which we denote by 
{B2,iv(uj) : j = 1, . . .,Kn}. Now, fix iV, j, l<j<N, and S2,7v(uj). There 
is clearly a rooted, finite, binary tree-structured partition Q^^'"') of 
for which S2,7v(uj) G \\Q^^'^^\ < N'^ and E(/iv|Q(^'^))(u) > N for 

u € i?2,Ar(uj). We therefore have the following lemma: 

Lemma 3.1. For j = 1, . . . ,Kn , HQ^^'^'^H < A^"^ and for all x e Uj^ , 
max{E(/jv|Q(^'-'))(u) :j = l,...,KN}>N. 

Now define / = /(u) on by 

oo 

(3.1) /=E/^- 

iV=3 

Clearly, / G C^{U'^). The finite, rooted binary tree-structured partitions of 
U"^ that concern us are 

. . . , V) , g(iV^2) ^ ) ^ _ _ _ ^ Q((;v+l)M) ^ _ _ _ ^ 

(3.2) 

From Lemma 3.1, it follows that HQ^^'-^'^H < N~'^. Since / > {N = 
3,4, . . .), it follows also from that lemma that 

max{i?(/|Q(^''i))(u), . . . , i?(/|Q(^''^iv^))(u)} > 
if u ^U^2- Therefore, if we relabel the partitions (3.2) as 

(3.3) ...,Q("),Q(n+i),..., 
then Theorem 3.2 follows. 
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Theorem 3.2. With Q^") as in (3.3), as n grows without hound, HQ*-"^!! - 
0, 6MiIh^^S(/|Q("))(u) = +oo onU"^. 

We continue now with a corollary that is due to R.M. Dudley and to 
Gordon and Olshen; see ([3], pp. 161-162). 

Corollary 3.3. There exist a probability u onU and an open set O C 
U with the property that 

z/{IE^£;(/o|g(")) > 0} > u{o). 

Here, Q^"^^ is as in (3.3). 

Next, we argue that the analogue of Theorem 1.3 is true in the present 
context. First, note that {Q^")} of (3.3) can be defined inductively so that 
for each n^W, \ U^s d{B'^'^\u))\ < 4. We will assume this to be the case. 
Here, for S CU'^, d{S) denotes its boundary and |5| its cardinality. 

Theorem 3.4. Let F = {g e C^{U^) : Imi^(5(|Q(")(u)) < oo, some u G 
U^}. Then F is of the first category in C^iU"^). 

Proof. For A: = 1, 2, . . . and M = 3, 4, . . . , let Ffc,M = {/ G C^iU"^): for 
some u G ||S('^)(u)|| < k'^ implies £;(/|Q('^))(u) < k}. Necessarily, F = 
y}kM^k,M- Therefore, it is enough to show that each Fj^ j^j is of the first 
category in C^{U'^). To that end, fix k,M and assume that {gj^k,M} (each 
in Fk^u) satisfies E{\gj^k,M — 5|) — > as j — > oo for some g G C^iU'^). By 
the definition of F^^m, for each j = 1,2,..., there exists Uj^k,M ^^li 
which ||S(")(uj-fc,M)|| < k'^ implies that E{gj^k,M\Q^'''^)i^j,k,M) < k. Ulj is 
compact and without loss of generality, we can assume that Uj^k,M ^k,M 
as j oo. For n = n(ufc,M)sufficiently large, Uk,M [J'^nd{B'^''\uk,M))- 
For / sufficiently large, ||i?(')(ufc j\/)| < k~^. Eventually, for each fixed I, 
Uj,fc,Af G -B(')(ufc,A/)- Therefore, B(')(ufc,M) = -B^'Hui,fc,J\/) for j sufficiently 
large. For such j, ^(5,- fc,A^|QW)(ufc,M) < k. Now, E(5|Q«)(ufc,M) < \Eig - 
5i,fc,M|Q^'^)(ufc,A/)| + k< E{\g - 5i,fc,M||<3^'^)(ufc,M) + k. By making then 
j, sufficiently large, we can conclude that g G Fi^^m-, that is, that F^^m is 
closed. 

We now show that Fu^m contains no open ball in C^{U'^). Let h G F^^m, 
f be as in the example and < a < 1. Because every function in is the 
limit of bounded, continuous functions and g is taken to be in an open 
bah in Fk. M, we can (and do) take h to be bounded and continuous. 
Define = (1 — a)h + af. Then E{\h — 5q|) — > as a — > 0. But for each 
fixed a, \im E{ga\Q^'^^) = 00 everywhere on U^. This completes the proof of 
Theorem 3.4. □ 
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4. Applications to rooted, binary tree-structured regression and classifi- 
cation. We return, now, in this last section, to the appUcation of the re- 
sults of Section 3 to arguments for Theorems 1.2 and 1.3 and Corollary 1.4. 
The point of Theorem 1.2 is that there exists a sequence Q(^) of finite, 
rooted, binary tree-structured partitions of the unit cube lA^ in Ti? for which 
IIQ^-'^) II ~* 0, as well as a set {(X„, Yn) : n = 1, 2, . . .} and an X that satisfy 
the assumptions given previously, for which X is uniformly distributed on 
U"^ and E{\Y\) is finite, but where P(^iv(X) ^ /i(X)) = 0. Write 

(4.1) |/i;v(X) - /i(X)| > |/ijv(X) - /i(X)| - \hN{X) - /ijv(X)| := // - /. 

The original question posed by Gordon and Olshen pertained both to 
|/i/vr(X) — /i(X)| and to //. If the counterexample to the almost everywhere 
convergence to of // for an h in C^{U^) implies the existence of an anal- 
ogous counterexample to the almost sure convergence of |/ijv(X) — /i(X)| to 

for an h with E{\h(X.)\) < oo, then Theorem 1.2 is proved. In fact, / can 
converge to almost surely while // does not. 

The asymptotic behavior of / depends on the large deviation behavior of 

sup \Fn{D)-F{D)1 
DeT> 

where D is a Vapnik-Chervonenkis class, as is the set of basic boxes, that is, 
the set of interval subsets oiU. For Theorems 1.2 and 1.3 and Corollary 1.4, 
we do not need Vapnik-Chervonenkis-like results. It is easy to adapt The- 
orem 3.2 so as to preserve the lack of convergence of //, while, in fact, 
/ tends to almost surely. Suppose that (X, Y) and the training sample 
(Xi,yi), . . . , (Xjv, Y/v) are as in (1.1), with X having a uniform distribution 
on U^. Fix an n and therefore Q^"^ as in (3.3) and let ^„ be as in (1.6). For 

1 = 1,2, . . . , let Yi = /(Xj), where / is as in (3.1) and Theorem 3.2. Recall 
that B G Q^") implies that \{B) > and that each Q^^^ has only finitely 
many members. Therefore, the strong law of large numbers implies that for 
any fixed Q^") and e = e„, < e„ < 1, there exists an iV(n,e„) sufficiently 

large that Pi\JN>N{n,en) I^nC^) - ^Af(X)| > e„) < e„. If is the term of 
a convergent series, then N{n,£n) can grow sufficiently fast and n suffi- 
ciently slowly so that Q^"^ applies to learning samples from size N{n,£n) to 
N{n + l,e„+i). The Borel-Cantelli lemma implies that / tends to almost 
surely. That is, the cardinality of the learning sample, N , and the Q^^-* that 
applies need not be related, other than for convenience. It follows that // 
can fail to converge to on a set of probability 1, while / can converge to 
with probability 1. When this happens, |/iAr(X) — /j(X)| fails to converge to 
on a set of probability 1. This completes the argument for Theorem 1.2. 

A repetition of the argument in the previous paragraph, with Theorem 3.4 
substituted for Theorem 3.2, completes the argument for Theorem 1.3. 
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Because h = h{u) can be approximated arbitrarily closely in C^(U'^) by 
a continuous function, it is clear that in the extended example of this sec- 
tion, /ijv(X) — /i(X) tends to in of the common probability space on 
which random variables (X, Y) and the learning sample are defined. This 
observation is analogous to the argument for Proposition 1 of [15]. 

We now turn our attention to a brief discussion of the two-class classifi- 
cation problem and argument for Corollary 1.4. A formulation is given after 
Theorem 1.3 of Section 1. We argue that the rule for two-class classification 
given next is Bayes-risk consistent, but not almost surely consistent. 

With Q(^) as in the arguments for Theorems 1.2 and 1.3, let dN{x) = 1 

if 

TV TV 

Y ■^[xGB,X,eB,y=l] > Y Y h^(^B,y.i&B,Y=2]^ 
B(ZQ{N) i=l BeQ{^) j=l 

otherwise, dp^{x) = 2. It follows from the construction of {Q^^^} and The- 
orem 12.17 of [1] that d]^ is Bayes-risk consistent. From the argument for 
Theorem 1.2 in this section, it follows that for any e > 0, Q^^^ and h can 
be chosen so that P(/i(X) < 1) > 1 - e, but P(/i(X) <l;dN^ ds) = 0. 
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